Two-Pass Greedy Regular Expression Parsing

نویسندگان

  • Niels Bjørn Bugge Grathwohl
  • Fritz Henglein
  • Lasse Nielsen
  • Ulrik Terp Rasmussen
چکیده

We present new algorithms for producing greedy parses for regular expressions (REs) in a semi-streaming fashion. Our lean-log algorithm executes in time O(mn) for REs of size m and input strings of size n and outputs a compact bit-coded parse tree representation. It improves on previous algorithms by: operating in only 2 passes; using only O(m) words of random-access memory (independent of n); requiring only kn bits of sequentially written and read log storage, where k < 1 3 m is the number of alternatives and Kleene stars in the RE; processing the input string as a symbol stream and not requiring it to be stored at all. Previous RE parsing algorithms do not scale linearly with input size, or require substantially more log storage and employ 3 passes where the first consists of reversing the input, or do not or are not known to produce a greedy parse. The performance of our unoptimized C-based prototype indicates that the superior performance of our lean-log algorithm can also be observed in practice; it is also surprisingly competitive with RE tools not performing full parsing, such as Grep.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stream Processing using Grammars and Regular Expressions

In this dissertation we study expression based parsing and the use of grammatical specifications for the synthesis of fast, streaming stringprocessing programs. In the first part we develop two linear-time algorithms for regular expression based parsing with Perl-style greedy disambiguation. The first algorithm operates in two passes in a semi-streaming fashion, using a constant amount of worki...

متن کامل

POSIX Regular Expression Parsing with Derivatives

We adapt the POSIX policy to the setting of regular expression parsing. POSIX favors longest left-most parse trees. Compared to other policies such as greedy left-most, the POSIX policy is more intuitive but much harder to implement. Almost all POSIX implementations are buggy as observed by Kuklewicz. We show how to obtain a POSIX algorithm for the general parsing problem based on Brzozowski’s ...

متن کامل

Optimally Streaming Greedy Regular Expression Parsing

We study the problem of streaming regular expression parsing: Given a regular expression and an input stream of symbols, how to output a serialized syntax tree representation as an output stream during input stream processing. We show that optimally streaming regular expression parsing, outputting bits of the output as early as is semantically possible for any regular expression of size m and a...

متن کامل

Bit-coded Regular Expression Parsing

Regular expression parsing is the problem of producing a parse tree of a string for a given regular expression. We show that a compact bit representation of a parse tree can be produced efficiently, in time linear in the product of input string size and regular expression size, by simplifying the DFA-based parsing algorithm due to Dubé and Feeley to emit the bits of the bit representation witho...

متن کامل

A text pattern-matching tool based on Parsing Expression Grammars

Current text pattern-matching tools are based on regular expressions. However, pure regular expressions have proven too weak a formalism for the task: many interesting patterns either are difficult to describe or cannot be described by regular expressions. Moreover, the inherent nondeterminism of regular expressions does not fit the need to capture specific parts of a match. Motivated by these ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013